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multiple constituent species and small systems in the grand canonical 
ensemble 
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In this paper we generalize a methodology [T. E. Ouldridge, A. A. Louis, and J. P. K. Doye, J. Phys.: 
Condons. Matter 22, 104102 (2010)] for dealing with the inference of bulk properties from small simulations 
of self- assembling systems of characteristic finite size. In particular, schemes for extrapolating the results of 
simulations of a single self-assembling object to the bulk limit are established in three cases: for assembly 
involving multiple particle species, for systems with one species localized in space and for simulations in the 
grand canonical ensemble. Furthermore, methodologies are introduced for evaluating the accuracy of these 
extrapolations. Example systems demonstrate that differences in cluster concentrations between simulations 
of a single self-assembling structure and bulk studies of the same model under identical conditions can be 
large, and that convergence on bulk results as system size is increased can be slow and non-trivial. 

PACS numbers: 87.14.G, 87.14.E, 87.19.Pp 



I. INTRODUCTION 

Self-assembly of monomer units into clusters of char- 
acteristic finite size is a central theme of biological 
and soft-matter systems. Examples include the forma- 
tion of spherical micelles?^ the self-assembly of virus 
capsids^r— the hybridization of DNA-i 2 - - — and the for- 
mation of protein complexes.— ~— With increased com- 
puting power and improved simulation techniques, it 
has become possible to simulate mesoscale models that 
reproduce such self-assembling behaviour. In recent 
years mcsoscopic models have been used to assemble 
micelles^ - — vesicles ; 29 ' 32 ' 33 hollow shells of specific sym- 
metry analogous to virus capsids 3 -^— and aggregates of 
particles which resemble protein clusters^ 3 - Additionally, 
reflecting the growth of DNA nanotechnology^ many 
coarse-grained models of DNA assembly have recently 
been proposed^ - — 

In many cases, these simulations report quantitative 
measures of assembly, such as the fraction of particles in- 
volved in clusters of a certain size. Experiments and ther- 
modynamic theories typically involve bulk systems with a 
large number of particles, capable of forming many target 
structures. So, ideally, simulations should be performed 
on large systems from which the concentrations of various 
cluster sizes can be directly extracted. In many cases, 
however, it is not practical to simulate a large system, 
particularly when a large free-energy barrier suppresses 
equilibration. Such free-energy barriers often arise when 
monomers interact in a complex fashion (such as in DNA 
self-assembly), or when many monomers must coopera- 
tively interact to create a stable target structure. Tech- 
niques such as umbrella sampling^ allow systems to equi- 
librate despite these barriers, but they are not suited to 
biasing the formation of a large number of targets. 

As a consequence, it is common practice to simu- 



late the assembly of a single target structure and at- 
tempt to infer bulk properties from the small-system 
data^-^ 3 ^ 3 ^-^^--^!^ Que obvious drawback 
of this approach is that any interactions between clusters 
(or tendencies to aggregate into macroscopic objects) are 
not observed. In many cases, however, monomers are in 
dilute solution and interactions between correctly formed 
targets are largely short-ranged and repulsive, and so 
these effects may be negligible. 

If inter-target interactions are indeed negligible, can 
the average density (yield) of clusters in a small simu- 
lation be taken directly as the yield in a bulk system of 
the same total concentration? Not if simulations are per- 
formed in the canonical ensemble. Although the average 
density of particles is the same as in bulk, local fluctu- 
ations are not captured. For example, figure Q] demon- 
strates that the statistics of an eight-particle system in 
volume 2v are not accurately captured by those of a four- 
particle system in volume v. 

To emphasize the scale of the errors that can arise from 
assuming that the yield of clusters in a single-target sim- 
ulation corresponds to the yield that would be measured 
for the same model in bulk, we consider a toy example 
of cooperative hexamer formation. Let us assume we are 
simulating a system of six particles in the canonical en- 
semble, and that these particles are found as either a hex- 
amer or six isolated monomers, with the ratio of hexamer 
states to monomer states given by $ = exp(20 — 0.4T), 
with T as the temperature - this model then has simple 
two-state behaviour. Note that this model is coopera- 
tive in the sense that the formation of a single hexamer 
is a cooperative phenomenon, requiring all 6 particles 
to be present, rather than that the presence of hexam- 
ers favours the formation of other hexamers. The yield 
curve of this small system as a function of temperature 
is plotted in Figure HJ 
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FIG. 1. Illustration of the neglected concentration fluctuations 
that lead to finite size effects in canonical simulations, (a) Two 
examples of states sampled during the assembly of a single tctramcr 
from two distinct monomer types. Now consider the assembly of 
two tetramers in twice the volume. For the purposes of analysis, 
separate this this larger cell into two halves, as in (b), (c) and (d). 
Some configurations of the system will have two particles of each 
type on either side of the partition, as in (b). These configurations 
will give the same average concentration of clusters as the original 
system of a single tetramer, as the available states in each half of 
the system are equivalent to the available states in (a). By contrast, 
configurations such as those in (c) and (d), in which the particles 
are unevenly distributed either side of the partition, will have a 
different average concentrations of clusters. This difference arises 
because the available states on either side of the partition are not 
captured by the single-target system. 



We can then ask the question: what happens if we sim- 
ulate a much larger number of monomers, in a volume 
such that the average density is the same as in the small 
simulation? The result actually depends on whether the 
hexamer consists of six identical particles or contains a 
number of distinct species. In the first case, which was 
treated in Reference the bulk yield of the model can 
be inferred assuming separate clusters behave ideally, and 
results arc plotted in Figure [2j An outline of this calcu- 
lation is also presented in Section Til Dl It is clear that, 
for this toy model, the bulk yield is very different from 
the small-system yield, the biggest difference being that 
the transition is far wider in bulk than in a single-target 
system. In other words, conditions that generate high or 
low yields of clusters in a single-target system tend to 




0.05 0.1 0.15 0.2 0.25 0.3 
T 

FIG. 2. Melting transition for a cooperative toy model of hex- 
amer formation from six identical monomers. The dashed curve is 
the fractional yield as a function of temperature for a single hex- 
amer formed from six particles in the canonical ensemble. This 
curve represents a toy two-state model with a ratio of hexamer to 
monomer states given by <E> = exp(20 — OAT). The solid curve is 
the same transition extrapolated to the bulk limit. 



generate less extreme yields in the bulk limit. Clearly, 
if one wishes to compare simulation data of a model to 
bulk experiment, it is important to estimate bulk yields of 
the model correctly before doing so. We note that if the 
assumption of ideality of separate clusters is not reason- 
able, the exact form of the deviation between small- and 
large-system yields will vary from that presented here: 
nonetheless, the cause of the discrepancy will persist and 
differences will remain large. 

In Reference l63l . we used a statistical mechanical ap- 
proach to derive the correct extrapolation procedure 
from small simulations of clusters formed from identical 
monomers, and dimers formed by distinct particles. The 
convergence on bulk yields as simulation size increased 
was also studied, enabling the approximations inherent 
in the extrapolation to be checked for any particular sys- 
tem. In this work we extend the methodology to include 
arbitrary size clusters with any number of particle types, 
including cases in which one of the constituent monomers 
is immobilized. 

Grand canonical simulations, which in principle incor- 
porate these local concentration fluctuations, also give 
misleading results if only a single large cluster is sam- 
pled. The size of these errors and a methodology for 
correcting them are also presented here, and compared 
to an alternative approach common in the literature. 

The techniques for extrapolating from small simula- 
tions presented in this work can be seen as methodologies 
for estimating free energies of assembly for a model sys- 
tem, from which equilibrium concentrations follow. This 
work emphasizes the substantial errors that can arise 
when simulation yields are not properly analysed, derives 
a methodology for performing this analysis and presents 
an approach for examining the accuracy of the assump- 
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tion of ideality underlying the theory for a given system. 
Additionally, we show how simulation yields converge on 
bulk values as the system size is increased, allowing an 
estimate of how large a simulation must be before finite- 
size effects are negligible. 

We emphasize that the methods discussed here will 
not necessarily make any given model agree better with 
experimental data: rather, they allow an accurate and 
physically meaningful comparison with experiment to be 
made. Failure to accurately account for finite size effects 
when matching a simulation of a small system to bulk 
experiments or thermodynamic theories is analogous to 
using a flawed algorithm to obtain the data. The re- 
sults obtained are not a true representation of the bulk 
behaviour of the model being simulated. 

The rest of this paper is structured as follows. Firstly, 
the assumptions and formalism relevant to this study are 
introduced in Section [I Al Next, finite size corrections for 
arbitrarily sized clusters of multiple particle species are 
derived and the convergence on bulk yields analysed in 
Section [TTJ The case of systems in which one of the re- 
actants is localized within a certain volume is then con- 
sidered in Section III Gl Subsequently, issues arising in 
grand-canonical simulations are discussed in Section IIII1 
and finally the inference of bulk properties other than 
cluster yields is outlined in Section ITVl 



A. Methodology, assumptions and definitions 

This paper uses a statistical mechanical approach to 
highlight differences between simulations of assembly of 
a single target cluster and simulations in which multiple 
targets can form, both in the canonical and grand canon- 
ical ensembles, and introduces an approach for relating 
them. The terminology and assumptions of this analy- 
sis are introduced here. To demonstrate the statistical 
finite-size corrections, several model systems are simu- 
lated in this work - as the details of each simulation are 
different, and because the simulations are only for illus- 
trative purposes, the methodology is briefly described at 
the relevant point in the text and covered in detail in the 
supplementary material^ 

We consider simulations in a small periodic cell of vol- 
ume v of interacting objects which are generically referred 
to as particles. We assume that the particles have some 
tendency to aggregate into clusters of finite size (rather 
than undergo a thermodynamic phase transition). It is 
assumed that, for the system in question, we have some 
way of defining which particles are in a cluster. The de- 
tails of such a definition are not important - we only 
require that clustering involves particles being in close 
proximity, and strongly bound particles will be part of 
the same cluster. It is also assumed that, in a bulk system 
at the appropriate concentration, clusters behave ideally 
(i.e., interactions between clusters are negligible except 
when they form larger clusters). We define a macrostate 
to consist of all states with a given set of clusters, whether 



in a small or large system. We now define some quanti- 
ties that will be of use in the analysis. The notation is 
more complex than in Reference l63l but it is also more 
powerful, allowing systems with an arbitrary number of 
particle species to be analysed. 

• D is an (arbitrary) scale factor relating a thermo- 
dynamically large volume to a simulation volume 
v. 

• y is the number of distinct monomer species 
present. 

• {*} = *2> •••> iy) defines a cluster containing ij 
particles of type j. On several occasions, it will be 
necessary to take a sum or product over all clusters. 
In these cases, {m} = (mi, m^, m y ) will be used 
as a set of dummy variables. For cluster formation 
from a single species, {i} reduces to a single integer 
(i). 

• rju\ is the number of times that the cluster {i} ap- 
pears in a macrostate. The macrostate is therefore 
completely specified by the set {rj}. Let rjj be an 
abbreviation for the number of isolated monomers 
of particle- type j. 

• zu\ is the partition function for cluster {i} in the 
simulation volume v, with the internal degrees of 
freedom treated distinguishably. The need to use 
expressions involving both distinguishable and in- 
distinguishable statistics arises in this work be- 
cause the enumeration of configurations is most 
easily done by treating particles distinguishably, 
and then accounting for indistinguishability after- 
wards. Furthermore, computer simulations natu- 
rally treat objects as distinguishable, which is par- 
ticularly important in grand canonical simulations. 
In this work, partition functions calculated with 
distinguishable statistics will be symbolized by z 
or Z, and indistinguishable partition functions rep- 
resented by q or Q. 

• Z^ v y is the partition function of a system of volume 
v when in a macrostate {rj}. This partition function 
is calculated using distinguishable statistics. 

• qu\ is the partition function for a cluster {i} in a 
volume Dv, with the internal degrees of freedom 
treated indistinguishably. qj is an abbreviation for 
the partition function of a monomer of particle- type 
j in volume Dv. 

• [{i}] is the concentration of cluster {i}, and [j] is 
the concentration of a monomer of particle type j. 
Here it is most convenient to use particles per unit 
volume as the measure of concentration. 

• fiu\ is the chemical potential of cluster {i}. Let fj,j 
be an abbreviation for the chemical potential of an 
isolated monomer of particle- type j. 
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II. SYSTEMS IN THE CANONICAL ENSEMBLE 
A. Multi-species assembly 

A number of authors have taken cluster yields in 
canonical simulations of single-target assembly as di- 
rectly applicable to bulk systems ^^^—^ These ex- 
amples involve dimers and clusters of a single species, 
which we analysed in our earlier work^ Interesting 
structures, however, are not exclusively dimers or formed 
from identical subunits. DNA nanostructures, such 
as polyhedra ; 65 ' 66 often involve several different single 
strands. Virus capsids can require more than one type of 
coat protein, and some work has been undertaken to sim- 
ulate models of such structures^ Simulators have also 
considered tcmplatcd assembly, in which distinct shells of 
particles form cooperatively^ Here we extend our pre- 
vious work to monodisperse assemblies of an arbitrary 
number of different species, so that quantitative analyses 
of such systems can be performed using data from small 
simulations. For small simulations of monodisperse as- 
sembly in the canonical ensemble, it is natural to use 
exactly enough particles to form a single target. Our 
discussion and examples will assume this is the case, al- 
though the results do not depend on it. 



B. Bulk equilibrium yields of self-assembly 

In this section we derive expressions for bulk cluster 
yields in terms of the quantities defined in Section II Al 
that will later be used to analyse small simulations. The 
results can be slightly simplified for assemblies of a sin- 
gle species: these simplifications are highlighted in the 
text. A standard result of equilibrium statistical me- 
chanics of ideal particles is that the chemical potential 
/irjj of cluster-type {i} in a volume Dv is given by 



Hi} 



-k B Tln 



(1) 



where the approximation becomes an equality in the 
thermodynamic limits In this limit, equilibrium ther- 
modynamics gives ^{ijHi} = f° r an Y possible 
reaction^ where uu\ are the stoichiometric coefficients 
of the species in the process. For the special case of a 
cluster {i} forming from its constituent monomers, the 
relation between the [i^y reduces to J^ . ijfij = fi^y 
Combining this result with Equation [TJ we obtain 



iU'/; iii'/; 



(2) 



for the equilibrium between a cluster {i} and its con- 
stituent monomers. We convert to concentration using 
the system volume Dv, giving 



(Dv) 



t -i 9{i} 
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~V W , (3) 



where itot = ^2 x i-x, and we have defined ip^y for later 
convenience. In the case of cluster formation from only 
one species of particle, the products in the denomina- 
tors contain only one term. The quantities v ttot ipu\ 
are model properties (related to binding strengths), and 
are independent of our small simulation volume v: ip{i}, 
which can be extracted from small simulations, have a v 
dependence that cancels with the explicit w 1 * *" 1 . 

If v Itot_1 -0{- i } arc known, it is possible to extract the 
equilibrium yields of all clusters by solving the simulta- 
neous equations given by Equation[3j(one for each cluster 
of more than one particle) and the conservation of total 
particle number, 



{m} 



Aim}] 



Ft, 



(4) 



where [x]t is the total concentration of particles of type 
x. One conservation equation is obtained for each dis- 
tinct species of particle. This result holds for any set of 
initial concentrations [x]t- Further, it is even possible 
to infer bulk yields in the isothermal-isobaric ensemble, 
provided our assumptions of ideality remain valid. In 
the isothermal-isobaric system at pressure p, the total 
volume is not fixed. Equations [3J and [4] still hold due to 
the equivalence of ensembles in the thermodynamic limit, 
and they are solved along with 



p/k B T = J2i{™}} 



(5) 



{m} 



which follows from the equation of state of ideal gases 
and allows the total volume as well as the concentrations 
to be determined. Note that the temperature T at which 
results arc inferred must be the same as that used to de- 
termine u ltot-1 ?/){ i }, which will generally be T-dcpendcnt 
in a non-trivial manner. In cither ensemble, the problem 
of obtaining bulk yields reduces to obtaining the quan- 
tities v lt ° t ~ 1 i}}^ and then solving a set of simultaneous 
equations. We note that, as v 1 * * -1 ^.^} are constants for 
a given T, Equation [3] is a classic 'law of mass action' as 
expected for a simple assembling system^ 



C. Appropriate ensembles and free energies 

Experiments are often performed under conditions of 
approximately constant particle number, temperature 
and pressure. This would suggest that the use of the 
isothermal-isobaric ensemble is appropriate, and indeed 
this is true for assembly processes studied in the gas 
phase, such as in Reference |62|. In this case, it would 
be more natural to convert the concentrations into par- 
tial pressures p x = k-g,T[x\. If a standard pressure p~®~ is 
introduced, we can convert Equation [3] into 



vp 



kT 



Hot — 1 



(6) 
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where is the partial pressure of cluster {i}. This is 

a well-known result (the law of mass action for an ideal 

system) and allows us to define a temperature-dependent 

dimcnsionlcss equilibrium constant K~®~(T), with an as- 

-e-. 
W 



sociated free energy change of formation AG" " 



vp 



-0- \ 'tot —1 



kT 



K^(T) 



exp(-AGf } /RT). 



(7) 

Measuring the quantities v ltat V'fi} is therefore equiva- 
lent to finding the standard free energy change of forma- 
tion at a given temperature. 

In many cases of experimental interest, however, the 
assembling particles are not isolated: other species are 
present, contributing to the total pressure of the system. 
If the interaction of these extra species with the self- 
assembling species is negligible, and their partial pres- 
sure is known, Equation [5] can simply be modified to 
(p — p')/ksT = X){ m }[{ m }]> m w hich p' is the partial 
pressure of the non-reactant species. 

Many examples of self-assembly, including the ma- 
jority of soft matter systems mentioned in Section 
HI occur in dilute solution. A dilute solution is a 
case in which the partial pressure of the solvent dom- 
inates that of the self-assembling particles, and in 
general the interactions of the solvent with the self- 
assembling particles are no n- negligible. Many of the 
mesoscopic models discussed in Section U treat the sol- 
vent implicitly^-^.^^^-ii^-i^^^ With the solvent 
treated implicitly, model clusters (at low enough concen- 
trations) will behave ideally except for when they bind to 
form a larger cluster. A self-consistent methodology for 
comparing these models to experiment would be to as- 
sume that clustering causes no change to the pressure of 
the system: i.e., the partial pressure of solute is negligi- 
ble and that the change of the solvent / solute interaction 
due to clustering has no effect on pressure. In this case, 
clustering does not influence the system volume and so 
the cluster yields in bulk are the same for the canoni- 
cal and isothermal-isobaric ensemble. Yields can then be 
inferred using a fixed volume, requiring only Equations 
[3] and 31 For dilute solutions, it is common to use con- 
centrations rather than partial pressures: introducing a 
standard concentration [c] "®" , the standard free energy 
change of formation AG^ follows from Equation [3] as 



=K^= exp(-AGg/i?r). (8) 



Note that although relative volume changes due to clus- 
ter formation in real systems may be small, meaning that 
the assumption of a constant total volume is reasonable, 
the pV (pressure-volume) contribution to the Gibbs free 
energy of cluster formation may not be negligible. In 
this case, mesoscale models with implicit solvents that 
are compared to experimental data would still neglect 
any volume change, but would incorporate the pV con- 
tribution to assembly implicitly as part of the effective 
interaction between particles. 



Some approaches, including fully atomistic representa- 
tions, explicitly model solvent particles^ -28 ' 31 ' 33 ' 37 Sim- 
ulations of such models can be analysed in terms of the 
solute clustering, treating the solvent implicitly at the 
level of the analysis rather than in the actual model. 
Single-target simulations performed at constant volume 
will neglect any pV contributions to assembly inherent 
in the model - with this caveat, bulk yields can be esti- 
mated through the methodology presented in this work. 
The cluster yields of single-target simulations of explicit 
solvent models in isobaric ensembles will also require sta- 
tistical finite-size corrections. If the variation in volume 
is negligible compared to the overall volume, then the 
methodology presented in this work can be used to esti- 
mate bulk yields by taking the average volume in simu- 
lations as the small-system volume. 

For the remainder of this work, the analysis will be 
presented in terms of dilute solutions (as this is most 
relevant to our work), and hence the bulk yield in the 
canonical ensemble is the appropriate quantity for com- 
parison with experiment. Nonetheless, obtaining al- 
lows the calculation of isobaric yields if desired. To avoid 
the complication of multiple (ufc]"® - ) 1 ' * -1 factors, it is 
simpler to analyse the problem in terms of the quantities 
q^ij and ipu\ and convert to molar concentrations after- 
wards. The majority of this work is focused on correctly 
estimating ip^y from single-target simulations. 



D. Inferring bulk yields from small canonical simulations 

To extract tjJu\ from small canonical simulations, it is 
helpful to calculate the contribution to the small-system 
partition function of a macrostate with rij particles of 
type j, arranged into the set of clusters {rj}. Under our 
assumptions, this is 



Z 



{'/} 



n 

{m} 



J1{m}) } - (nlmj) 



(9) 



This expression is obtained by multiplying the individual 
distinguishable partition functions Zi m \ together, then 
considering all possible permutations of identical par- 
ticles which change the clustering. Dividing by YVj n r 
would make the statistics indistinguishable. For a single 
constituent species, the products over x and j contain 
only one term. The yield of a certain cluster {i} in the 
small simulation, u[{i}](i), follows from and is given 
by 



«[{<}] 



(i) 



T,{r,}V{i} Z {y} 



(10) 



Here the sum runs over all possible macrostates {77}: the 
denominator is then the entire partition function of the 
n-particle system. The subscript in u[{i}]m indicates 
that we are considering a single-target system. Multiply- 
ing the concentration by the original simulation volume 
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v means that yields reported are the average numbers of 
different clusters in a volume v, a convenient dimension- 
less quantity. 

One can relate the q^y to zu\ by incorporating the 
relative scale factor D of the volumes in which they are 
defined, and accounting for the over-counting of indistin- 
guishable states within zu\ . We obtain 

g» = Z {i} 

In the case of cluster formation from a single type of par- 
ticle, the product over x contains only one term. Com- 
bining Equations [3J [9] [10] and [TT] then yields 



(11) 



«[{*}](!) = 



z^w 11 - 
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En 

{r;} {m} 



m}J 



(12) 



liwJ 



We have therefore expressed the small system yield as a 
function of the ratios ij}s m \, or equivalently tpu\ (as m 
and i are just labels), which determine the bulk yield. 

can therefore be extracted from a small simulation 
by fitting the observed yields u[{i}](i) to Equation [T2l 
and the bulk yields obtained as discussed in Section III Bl 
For the case of homoclusters (clusters consisting of one 
species of particle), an alternative method that does not 
require fitting and automatically decouples the simulta- 
neous equations is possible, as outlined in Reference l63l . 

It is instructive to reconsider the toy model of coop- 
erative hexamer formation introduced in Section fl] The 
results of Figure [5] follow directly from Equations [3J 0] 
and 1121 The ratio of hexamer to monomer states in a 
small simulation is given by $ = exp(20 — OAT), with T 
as the temperature. 

. «[(6)] (1) = */(l + *), «[(1)] (1) = 6/(1 + *). 

• 0(6) = */6! is found by substituting the single- 
target yields of the toy model into Equation [12] 
(-0(1) = 1 by definition). Note that in this case 
Equation Q2] can be solved for ipux (rather than re- 
quiring fitting) due to the simplicity of the system, 
but this is not generally the case. 

• The bulk fraction of hexamers, / = v[(6)], can then 
be shown to obey 6!/ = 6 6 $(1 — /) 6 using ip/g\ and 
Equations [3J and 2] 

• This equation can be solved numerically to give a 
bulk fractional yield of hexamers / = v[(6)] that 
can be compared to the single-target yield u[(6)](i). 

Alternatively, we can imagine a different system in 
which the hexamer consists of three particles each of two 
different species. Let us again assume that in single- 
target simulations the system follows a two-state model 
with a ratio of hexamer to monomer states given by 
$ = cxp(20 - OAT). 



., Single hexamer 



Bulk hexamer 
(single species) 




FIG. 3. Completely cooperative transition for a hexamer formed 
from six monomers. The blue curve is the yield as a function of 
temperature for a single hexamer in the canonical ensemble follow- 
ing a two-state model with a ratio of hexamer to monomer states 
given by $ = exp(20— 0.4T). The green curve is the same transition 
extrapolated to the bulk limit (with the same total concentration 
of particles) in the case where the hexamer is formed from iden- 
tical particles, and the red curve is the extrapolation to bulk for 
the same small system result when the hexamer is formed from two 
different species, contributing three particles each. 



«[(3,3)] ( i) 
3/(1 + *). 



*/(! + *), w[(l,0)] (1) =v{(0A)} {1) 



^(3,3) = */ (3!) 2 is found by substituting the single- 
target yields of the toy model into Equation Q2] 
(0(i,o) = "0(0,1) — 1 by definition). Once again, 
Equation [T^] can be solved for ipu\ in this simple 
case, rather than requiring a fit. 

The bulk fraction of hexamers, / = i>[(3,3)], can 



then be shown to obey (3!) 2 / = 3 6 $(1 
0(3,3) an d Equations [3] and SJ 



/) using 



• This equation can be solved numerically to give 
a bulk fractional yield of hexamers / = v[(3, 3)] 
that can be compared to the single-target yield 
w[(3,3)] (1) . 

The extrapolations for these two different systems 
(with the same single-target yield) are plotted in Figure 
[3J along with the single-target yield. As with clusters 
of one particle type, the bulk transition is far broader 
than in the single-target case. This widening effect can 
be understood by considering the effect of concentration 
fluctuations, as illustrated in Figure [T] If a system of 
twice the size is considered, fluctuations in concentra- 
tion like that in Figure[5](c) can occur. Such fluctuations 
strongly favour the formation of exactly one target struc- 
ture rather than zero or two, as it is impossible to form 
one on the left of the box and on the right hand side, 
the extra particle makes the formation of a cluster much 
more statistically favourable. Increasing the system size 
and allowing concentration fluctuations within cells of 
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volume v therefore tends to give yields that are less domi- 
nated by one particular cluster size than the single-target 
system. The result is a much broader transition in bulk. 

It is also possible to understand why the heterocluster 
yield is lower in bulk than for homoclusters with the same 
single-target yield. For heteroclusters in bulk, we have 
fluctuations of concentration in a volume v not only of 
the total particle number, but also of the relative num- 
ber of each type of particle, as shown in Figure [2](d). 
These fluctuations always disfavour the formation of tar- 
get clusters, resulting in a lower yield in bulk for the same 
single-target yield. 

Although we do not claim that the methodology pre- 
sented in this work will make a given model agree better 
with an experiment, it is worth noting that single-target 
yields do not obey the law of mass action as would be ex- 
pected for simple models, unlike bulk yields. As a simple 
example, consider a dimer-forming system of two distinct 
particles. The law of mass action, as embodied by Equa- 
tion H predicts that [(1,1)] oc [(1, 0)][(0, 1)]. For a sto- 
ichiometric solution, this equation reduces to [(1,1)] oc 
[(1,0)] 2 . In a single-target system, [(1, 1)] (1) /[(1, 0)] (1) oc 
1/v as doubling the volume with the same number of 
particles will halve the ratio of bound to unbound states. 
However, 1/v = [(1,0)]t, the total concentration of par- 
ticle of type 1. Therefore, [(l,l)] ( i) oc [(1, 0)] (1) [(1, 0)] T , 
which is a fundamentally different result from the law of 
mass action. 



E. Convergence on bulk yields 

It is instructive to consider how yields converge on their 
bulk values as system size is increased (whilst maintaing 
the same total concentration of particles). Firstly, this 
gives an idea of how large simulations must be to reflect 
the thermodynamic limit. Secondly, it provides a tool 
to check the validity of the extrapolation in Section III Dl 
if it is possible to simulate the formation of two targets 
in twice the volume, the change in yield from the first 
simulation can be compared to the predictions of this 
section to ensure that the assumptions underlying the 
theory are accurate. 

Consider a simulation of a system of size d with the 
same total concentration as the relevant single-target sys- 
tem, where d is not necessarily thermodynamically large. 
We can extend the concepts of the previous section, in 
which an expression for the yield of clusters for d = 1 
was found in terms of ip^y, in a very simple fashion to 
give 
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FIG. 4. Convergence on bulk yields for cooperative hexamer 
formation as a function of system size d. (a) low hexamer yield 
(5% in bulk), (b) high hexamer yield (95 % in bulk). The solid 
curve depicts convergence for hexamers consisting of two distinct 
species, each contributing three particles to the hexamer (in this 
case V(3, 3 ) = 9.33 X 10~ 5 and 8.34 X 10 4 for the 5 % and 95 % cases 
respectively). The dashed curve shows the result for hexamers 
containing six identical monomers (with V(6) = 1-30 X 10 3 and 
1.46 X 10 -6 for the 5% and 95 % cases respectively ). 



volume. An alternative useful quantity is the fraction of 
particles of type a that are found in clusters of type {i} 
as a function of simulation size d, fuyu) = 7f~ w [W](d); 
where n a is the number of particles of type a in the single- 
target simulation. For a given set of , one can explic- 
itly calculate /{j}^) ana - observe its convergence on bulk 
values. In all systems we have studied, f^y^ — /fi}(oo) 
scales as 1 jd at sufficiently large d, although convergence 
at low d can be more complex. 

To make more concrete statements, consider the com- 
pletely cooperative toy model of hexamer formation in- 
troduced in Section |TJ For a given bulk yield, ^(3,3) (or 
in the single-species case) can be inferred from Equa- 
tions [3] and 21 and then substituted into Equation [T3] to 
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give the fractional yield as a function of system size. Even 
in this simple case, two qualitatively distinct regimes of 
convergence are observed, at high and low yield of the 
target structure, as shown in Figure 2] At low yield, con- 
vergence is monotonic and quickly settles down to the 
l/d form. By contrast, convergence at high yield is ini- 
tially slow, before a more rapid decay towards the bulk 
value. In some cases, converge can involve oscillations 
before the l/d regime is reached. 

These results are qualitatively comparable to equiva- 
lent size homoclusters. It is noticeable that, for the same 
high bulk yield of target clusters, heterocluster conver- 
gence is slower and has less pronounced oscillations than 
for homoclusters. Convergence is slower because, in or- 
der to generate the same high target yield in the infinite 
limit, the single-target simulation must have a higher ra- 
tio of target clusters to monomers ($) for heteroclusters 
than homoclusters (as can be seen in Figure and was 
discussed in Section III Bp , and convergence to the bulk 
value is then slower. By contrast, heterocluster conver- 
gence is slightly better at low target yields, as this time 
the need to have a higher $ to obtain the same bulk yield 
reduces the error. 

The oscillations at high yield for homoclusters coin- 
cide with the points at which macrostates with a certain 
number of target clusters come to dominate the ensemble. 
Initially, the macrostate with d target clusters (all parti- 
cles are found in clusters of the largest size) is dominant. 
Eventually, as system size is increased, the macrostate 
with d — 1 target clusters becomes dominant due to the 
cntropic cost of having no monomers. However, the d— 1 
macrostate becomes dominant before (d— l)/d = ffe^j 
(the fractional yield of target clusters in the bulk limit), 
and consequently /{ t i( d ) < f{t}(ooY ^ s ^ increases fur- 
ther, the d — 1 macrostate remains dominant but now 
(d - l)/d > /f n(oo) , and so ff t}(d) > /f n(oo) . Smaller 
oscillations are then repeated as macrostates with d — 2, 
d— 3 etc. targets successively become dominant. Eventu- 
ally the oscillations are overwhelmed by the overall l/d 
convergence. The suppression of oscillations in hetero- 
clusters can be understood in terms of their slower con- 
vergence - as the configurations with more monomers 
take longer to become dominant for a given bulk yield, 
the tendency to underestimate the bulk yield is sup- 
pressed. 

Of course, real systems are not perfectly cooperative, 
and the finite concentration of intermediate clusters has 
consequences for the convergence properties. Again, we 
cannot claim to have tested all possibilities but the ef- 
fect of intermediate cluster sizes appears to be similar to 
the effect in homoclusters^ The consequences for con- 
vergence are most pronounced when the prevalent inter- 
mediate clusters are close in size to the majority cluster, 
when convergence is generally slower than if the interme- 
diates are absent. 




FIG. 5. Snapshot from a simulation of a trimer-forming DNA sys- 
tem, using the DNA model of Reference 68. Backbones of strands 
of type 1 are coloured red, 2 are blue and 3 are green; all bases are 
coloured sky blue. The cluster nearest to the centre is typical of a 
three-armed trimer, an isolated stand is shown on the far right and 
a two-strand intermediate is shown in the top left. 
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TABLE I. Yields of DNA clusters in a trimer forming system. 
Clusters {i} are defined by the number of each strand type that 
they contain: (ii,«2,«3) contains ij strands of type j. The defi- 
nition of what constitutes a cluster is given in the supplementary 
material— Yields of each cluster are shown for simulations of a 
single cluster (v [{i}] nT), an< l bulk results are extrapolated from 
these data using the methodology discussed in the text ("[{«}] prcd )- 
Yields from simulations of two clusters (v [j]^™) are compared to 

the yield , which is predicted from using Equa- 

tion rrj 



F. Example extrapolation 

To demonstrate the use of the extrapolation technique 
on a model system, we consider the formation of three- 
armed DNA trimers from distinct strands, using the 
coarse-grained DNA model of Reference [H. This model 
treats DNA as a string of rigid nucleotides with effective 
interactions to model chain connectivity, excluded vol- 
ume, hydrogen bonding and base stacking. In this work 
we are not really concerned with how good an approxima- 
tion the model is to reality. We are simply demonstrating 
that the extrapolation procedure can be applied to real 
simulation results, giving bulk statistics that could then, 
if desired, be sensibly compared to experiment. 

We consider three distinct strands of DNA, the 
sequences of which are given in the supplementary 
material^ The three strands tend to form three-armed 
junctions, as each strand has two 6-base sections, each of 
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which is complementary to a 6-base section on a different 
strand. An example of the three-armed junction is given 
in Figure [5j which also shows a two-strand intermediate 
and an isolated strand. 

We simulated one strand of each type in a periodic 
cell, measuring the resultant distribution of clusters. De- 
tails of the simulations are provided in the supplementary 
material^ The yields of various clusters resulting from 
these simulations are tabulated in Table |TJ Also shown 
are the yields predicted for bulk by extracting ipux from 
fitting Equation [T^] to the data and then solving Equa- 
tions [3] and 2] for [{«}]. As is evident, the bulk yields 
are significantly different from those in the single-target 
simulations: specifically, the high yield of trimers in the 
single-target simulation is reduced, and the lower yields 
of single strands and two-strand complexes are increased 
in bulk. This is as expected from the general broaden- 
ing of the transition that was discussed in Section III Dt 
as the dominant cluster in a single-target simulation be- 
comes less dominant in bulk. 

This extrapolation assumes that separate clusters be- 
have ideally. As the DNA model used here has only short- 
ranged interactions, and the system is fairly dilute, this 
seems a reasonable assumption. We can perform a more 
rigorous test, however, in that we have used the same 
methodology to predict not only the yield in the infinite 
system-size limit, but also how the yield changes as the 
system size is increased. The expected yield in a two- 
target simulation with the same total density, inferred 
using tp^ and Equation[T3l is given in Table HI Although 
more challenging than the single-target simulation, it is 
also possible to simulate the simultaneous formation of 
two trimers in twice the volume using a high-dimensional 
reaction coordinate for umbrella sampling: additional in- 
formation on these simulations is given in the supple- 
mentary material^ The resultant cluster yields are also 
shown in Table |U 

As is evident from Table HI the predicted and measured 
yields in a two-target simulation are in excellent agree- 
ment. This strongly suggests that the assumptions of the 
extrapolation procedure (such as ideality) are reasonable 
for this model under these conditions, and therefore that 
the bulk values reported in Table U are representative of 
the yields that would follow from a macroscopically large 
simulation. Furthermore, it provides a 'sanity check' of 
the accuracy of the approach presented in Sections IIIB1 

irroiandirrEi 



G. Theory of localising a single reactant species 

In some experimental systems, the particles that asso- 
ciate are not all free to diffuse. For example, DNA mi- 
croarray assays consist of DNA 'probes' which are teth- 
ered to a surface, and 'target' molecules which diffuse 
through solution^^ With the advent of DNA origami, 
experimentalists are now able to localize isolated reac- 
tants at will^ Figure |5] illustrates such a localisation 



for a generic system. Several groups have simulated the 
binding of DNA to a tethered strand, extracting quan- 
titative estimates of melting temperatures without ap- 
plying finite size corrections £2r£§. We note that practical 
DNA microarrays typically have such a high density of 
strands tethered to the surface that clusters are unlikely 
to behave independently and ideally^ the simulations 
in References I5a - t58l however, considered isolated teth- 
ered strands and hence can only be sensibly compared 
to much sparser systems. To perform this comparison, 
it is necessary to consider whether corrections must be 
applied to yields from single-target simulations. 

It is not a priori obvious whether tethering one reac- 
tant will change our earlier results, for which local con- 
centration fluctuations were invoked to explain the differ- 
ence between bulk and small-system statistics. Note that 
here we are not concerned with whether the mechanism of 
tethering interacts with the particles, either destabilizing 
or stabilizing the bound state. For example, the presence 
of a surface to which a particle is attached could be either 
attractive or repulsive for the non-localized particles. In- 
stead, we are concerned with whether extrapolation to 
bulk for a given set of differs from Section HlDl 

To analyse this problem, it is instructive to consider 
how the standard result of J^. Villi = 0, with fii given 
by Equation [1] and Vi being stoichiometric coefficients 
in a reaction, arises directly from the partition function. 
The contribution to the partition function (calculated us- 
ing indistinguishable statistics) of a large system with a 
macrostate which has clusters of type {i} (with all 
clusters behaving ideally) is given by 



Q {n} (D) = l 



{<} 



V{i} 



n 



(14) 



This expression contains a product over all the partition 
functions of the individual clusters, divided by an ?7{i}! to 
avoid double counting of states, which must be included 
because each qin includes a separate integral for each 
cluster over the whole of the system volume. Maximizing 
Q{ v } with respect to {rj} yields the standard result. 

We now consider a system in which one of the reac- 
tants is immobilized (let this be particle type 1, and let 
us further assume that only one particle of type 1 can be 
involved in any given assembly) . The first consequence is 
that there is no need to divide by ?7{i}! when the cluster 
{i} includes the immobilized species, as there is no ten- 
dency to count indistinguishable states twice when the 
clusters cannot move over all space. One must, how- 
ever, still deal with combinatorial effects. In particular, 
we now have to calculate the combinatorial factor associ- 
ated with the number of different choices of immobilized 
particles that are involved in cluster formation. This in- 
troduces a factor 



ni< 



(15) 



in which n\ is the total number of localized particles in 
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FIG. 6. Schematic depiction of a localized species. The green 
particles are localized near the centre of the cells, and hence their 
concentration docs not fluctuate. 



the system. The second effect is that q^q does not scale 
with system volume for i\ ^ 0, so that 
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for i\ / 0. 



(16) 



Including both alterations, we obtain the following par- 
tition function of a system with immobilized particles in 
the macrostate {N}, 
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In terms of Zu\, Equation [T7] becomes 
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It is trivial to check that for two sets of clusters {77} 
and {r/}, Q^i/Qf™'} nas exactly the same functional de- 
pendence on Zjj}, 77^1. and rj'^x as Q{ v }/Q{ v '}- Therefore 
there are no statistical consequences of tethering one of 
the reactants - a given set of which means a given 
set of yields in a single-target simulation, will extrapo- 
late to bulk yields that are identical to the case in which 
all species are free to diffuse. In other words, the proce- 
dure of scaling single-target results to bulk is unchanged, 
although the single-target results themselves may be in- 
fluenced by the localisation mechanism. We also note 
that at no stage have we used the fact that D is thermo- 
dynamically large in this argument, so it applies just as 
well to systems of intermediate size. Therefore it is not 



only the scaling to the bulk limit that is unchanged by 
tethering, but also the form of the convergence on the 
bulk limit as system size is increased. 

Practically, this means that Equations [3] and 0] can 
be directly used to calculate the expected bulk concen- 
trations from any initial set of reactant concentrations, 
having used Equation [12] to extract tp{i} from a single- 
target simulation in a small volume. Note, however, that 
although the yields of clusters are expressed as concentra- 
tions, tethered clusters will not be uniformly distributed 
throughout the system. Similarly, the convergence of 
yields as the system size is increased at a constant total 
density can be followed using Equation 1131 An example 
of such an extrapolation is provided in Section III HI 

Physically, the result is identical to the unlocalized case 
because in this idealized limit the only important coordi- 
nates are the relative separations of cluster-forming par- 
ticles. As a consequence, it is irrelevant that particles 
of type 1 are tethered, as the concentration fluctuations 
of the other particles in the vicinity of type 1 provide 
the same statistical correction as the untethered case. 
Such an argument does not hold if two particles that 
are involved in an assembly are localized. For a trivial 
counter-example, one could take heterodimer formation. 
Localizing both species will give thermodynamics identi- 
cal to the small system limit. 

When performing simulations of this kind, it is pos- 
sible that non-tethered particles will interact with the 
tethering mechanism in the unbound state. If this has 
a significant effect on the unbound partition function, it 
will lead to errors in the extrapolation. Changing the 
simulation volume and observing whether the statistics 
of bound states change in the expected way can check for 
such effects. 



H. Example extrapolation for a localized species 

As a demonstration of extrapolation with a localized 
particle, we consider the formation of a six-base-pair 
DNA duplex using the model of Reference [68l One of 
the strands in this duplex has a three-base tail, which 
is permanently attached to a repulsive surface by its 5' 
end (DNA strands are directional: the two ends are la- 
belled 3' and 5'). The other strand is free to diffuse. We 
performed simulations of a single-target system, and of 
a system of twice the volume and number of strands, us- 
ing umbrella sampling to enhance equilibration. Figure 
[7] shows typical bound and unbound states from a single- 
target simulation, highlighting the tethering of the longer 
strand to a surface. Further details of the simulations are 
provided in the supplementary material.— 

The yield of duplexes in a simulation volume, 
= 0.764(8), was obtained in the single-target 
simulations. This implies a ratio of ■Z{i.i}/( z {i,o} 2: {o.i}) = 
^{1,1} = 3.24(14), from which we infer an expected two- 
target yield of u[{0]( 2 r ) = 0.667(8) using Equation Q2] 
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(a) 



(b) 



FIG. 7. Two states from a simulation of DNA binding to a tethered 
strand. In both cases the longer red strand is attached to the 
surface by the 5' end. The parallel lines represent the excluded 
volume of the surface: the centre of any backbone site is forbidden 
from entering this region. In (a), the two strands are bound; in 
(b), the shorter blue strand is detached and free to diffuse. 



and a bulk fraction of duplexes w[{i}] pred = 0.578(7) us- 
ing Equations [3] and |U Our measured two-target yield 
of dimers, ^[{i}]^™ = 0.651(7), does not show a statisti- 
cally significant difference from the prediction, suggesting 
that the bulk value is also reliable and that non-ideal ef- 
fects are not detectable at this precision. As expected, 
the yield of the dominant cluster size (the duplex) is re- 
duced due to the extrapolation, consistent with the gen- 
eral broadening effect of concentration fluctuations on 
transitions. 



III. SIMULATIONS IN THE GRAND CANONICAL 
ENSEMBLE 

A. The technique 

As an alternative to canonical simulations, it is possible 
to use the grand canonical ensemble (at first, we consider 
a system with only one species of reactant). Instead of 
fixing the number of particles absolutely, one simulates a 
system in a volume v such that configurations containing 
n particles are sampled with the relative probability^ 



cP»"Z(n) 
P(n) oc : , 

77 ' 



(19) 



where Z(n) is the partition function of an n-particle sys- 
tem in volume v, calculated using distinguishable statis- 
tics, and the n! accounts for distinguishability. /i is the 
chemical potential of the monomers, which regulates the 
average concentration. 

Restricting ourselves initially to one species of reac- 
tant, the set of numbers {j}, which identifies a cluster, 
is reduced to a single integer (j), the cluster size. Wc 
retain the brackets for consistency with earlier notation. 
A macrostate {77} is then observed in a simulation with 
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where Zy) is the partition function of a cluster of size 
j in the volume v, as before. The result follows from 
multiplying individual distinguishable partition functions 
Ztj) together, including combinatorial factors to account 
for exchange of particles and multiplying by e' 9 ^™, with 
n being the total number of particles in {77}. 

In principle, small grand canonical simulations are ca- 
pable of capturing the concentration fluctuations high- 
lighted in Section fl] We will first demonstrate that the 
average concentrations found in a small volume in the 
grand canonical ensemble are identical to those in a bulk 
system with the same monomer chemical potential. The 
bulk equilibrium yields of clusters can be expressed in 
terms of the chemical potential of the monomers: Equa- 
tion [T] implies 



[(1)] = Zl e^/v, 



and 
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(22) 



follows from combining Equations G2 [TT] and [5TJ We now 
consider the yield of clusters in a small grand canonical 
simulation of a volume v. Let f)(j\ be the average number 
of clusters of type (j) observed during simulation. As 
P({i]}) in Equation |2T)1 factorizes into separate terms for 
each cluster type, the calculation of itq) does not involve 
the properties of clusters other than (j). Therefore, using 
Equation \20\ wc find 
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This equation can be rewritten as 



-j 3 t^ \ E 
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The sum inside the logarithm is actually the series ex- 
pansion of an exponential, allowing the expression to be 
easily evaluated 



1 to = Mot{-f"*Ww) 
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(25) 



Dividing by v to give a concentration shows that Equa- 
tion[5S]is consistent with Equations[5T]and[221 and hence 
that a small grand canonical simulation should provide 
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the same cluster concentrations as a bulk simulation of 
the same system. If desired, measured cluster concentra- 
tions can then be used to evaluate tp^ through Equation 
121 With , model systems can be compared to experi- 
ment under a range of conditions, as discussed in Section 

Eel 



B. Quantification of errors 

Cluster yields can only be estimated accurately, how- 
ever, if multiple large clusters can exist simultaneously 
during the simulation. Here we define 'large' to mean 
any cluster containing more than one particle. When 
assembly is difficult and biased sampling techniques are 
employed, it is often impractical to bias the formation of 
multiple large clusters. Consequently, states with multi- 
ple large clusters are never sampled and errors are intro- 
duced. Here we derive how the observed concentration of 
clusters differs from the true yield if only a single large 
cluster is sampled. 

Let us assume that a simulation samples states that 
contain at most a single cluster of more than one particle. 
In this case, the average number of clusters of size j > 1 
in the simulation volume is given by 



(i) 



^ ( l) ^ e' 3 '' , 



1 



(26) 



The numerator in this expression arises from summing 
Equation [2D] for states that contain one cluster of size j 
and any number of isolated monomers, and the denom- 
inator from summing over all possible states containing 
at most one cluster larger than a single particle. Having 
simplified the fraction, and using Equation[22] we are left 
with 



(i) 
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(27) 



as the concentration for a system restricted to at most 
one non-trivial cluster. We note that this concentration 
is not directly comparable to single-target results in the 
canonical ensemble. As the two are never needed for the 
same simulations, however, the use of the same notation 
should not cause confusion. 

The relative difference between simulation results and 
the true behaviour of the model is easy to quantify: 



[U)} - [(j)](d E fc >i"[(*Q] 
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(28) 



u[(fc)] will grow proportionally with the simulation vol- 
ume. Consequently, the relative errors will initially grow 
linearly with the simulation volume before plateauing in 
the limit oi ^2 k>1 v[(k)] 3> 1 (when the relative error is 
approximately unity). 



C. Correcting for errors 

In this section we show how to extract [{j)] from the 
measured [(j)](i). It can be trivially shown that, under 
our assumptions of ideality, the predicted bulk yield of 
isolated monomers is the same as the single target yield. 
Equation [57] can be rearranged to give j max — 1 linear 
simultaneous equations for the remaining [(j)], 
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where j ma x is the largest cluster considered. These equa- 
tions can then be solved using standard matrix inversion 
techniques. 

In reality, most simulations will not explicitly forbid 
the presence of multiple clusters of more than one par- 
ticle (although this can be done, as in Section UlIFI) . If, 
however, cluster formation involves a significant free en- 
ergy barrier, and only the formation of a single cluster 
is actively biased by the simulation, multiple large clus- 
ters will not be observed. In these cases, Equation [27] 
can be used, but any rare instances where multiple clus- 
ters of more than one particle do occur (for example, two 
dimers) must not be included in estimating [0)](i). 



D. Relevance to previous studies 

To date, grand canonical techniques (and related semi- 
grand canonical approaches) have primarily been used to 
study the formation of micellar structures^ -28 ' 30 ' 62 as 
opposed to monodisperse target structures. There is no 
reason, however, that grand canonical simulations could 
not be used for monodisperse targets, and the results 
presented here can be used equally well for both types of 
assembly I n many cases in the literature, monomer con- 
centrations are assumed to be so low relative to the sim- 
ulation volume that the probability of finding more than 
one cluster in a simulation box is neglected in the analy- 
sis. The number of monomers in a simulation at a given 
instant is then taken as a proxy for cluster size^ 2 - -24 ' 62 
There has also been considerable debate on the details of 
inferring cluster probabilities from simulations in which 
one monomer is fixed at the centre of the simulation vol- 
ume, under this extremely dilute assumption ! 62 ' 73 ' 74 

As the methodology presented here to extract bulk 
yields is so simple, this extremely dilute assumption 
seems unnecessary. The risk is that states which are re- 
ally characteristic of a cluster of size j and another of 
j' are treated as a single cluster of size j + j' . The fre- 
quency of such mis-labclling would tend to increase with 
simulation volume, resulting in quantitative errors. 

It has been argued that the extremely dilute assump- 
tion is valid provided J^k v l(k)] 1, as the probability of 
sampling a state with two actual clusters is much smaller 



than observing either in isolation* 7 - 5 - Unfortunately, this 
is not necessarily the case. For example, consider systems 
in which [(j + 1)] <C [(.])]■ In these cases the probability 
of a volume containing a cluster of size j and an ad- 
ditional isolated particle may be large compared to the 
probability of observing a genuine cluster of size j + 1, 
even if [(j)] and [(1)] are small. There are two cases when 
this is particularly likely to be relevant: 

1. [(2)] <C [(1)] is likely to be true for large assemblies, 
such as micelles, when many particles are needed 
to stabilize a cluster. 

2- [(jo + 1)] ^ [(jo)] wu l be true for monodisperse 
assemblies where the assembly product has a size 
of jo- 

Both of these conditions hold in the system analysed as 
an example in Section [ill Fl 

As with the corrections for small systems in the canon- 
ical ensemble, applying this methodology will not neces- 
sarily give data that seem to match experiments more 
closely. Rather, these corrections allow data from simu- 
lations to be sensibly compared to experiments or theo- 
ries based on bulk properties. Although in many cases 
the effects might be quantitative rather than qualitative, 
given the low computational cost of the correction scheme 
it would seem sensible to apply it. 

E. Implementation issues of the correction scheme 

Are there any drawbacks to using the extrapolation 
method outlined here? Firstly, it relies upon the assump- 
tion that separate clusters can be described as behav- 
ing approximately ideally. Such a problem will always 
arise when bulk physics is extracted from a single self- 
assembling cluster. Furthermore, as discussed in Section 
IIIIF1 this methodology allows the assumption of ideality 
to be checked. 

From the perspective of practical implementation, it is 
necessary to have an algorithm that evaluates the cluster 
distribution in a configuration, which is potentially com- 
putationally costly. By contrast, if the number of parti- 
cles in the system is simply taken as a proxy for cluster 
size, no such calculation is required. To reduce this cost, 
the clustering could be sampled only every f > 1 steps 
of the simulation. If t is similar to the number of steps 
over which energy correlations within the system are lost, 
such a reduction of sampling frequency will have a limited 
effect on simulation accuracy. 

F. Example extrapolation 

As an illustration of inferring bulk yields from small 
grand canonical simulations, we consider the patchy- 
particle model of Wilber et al^ As with the DNA sim- 
ulations in Sections III Fl and III HI we are simply using 
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FIG. 8. Snapshot from a simulation of the patchy particle model of 
Wilber et aim— The depicted state contains two fully formed cubes 
(eight-particle clusters) and five isolated monomers. 



this model as a typical self-assembling system with which 
to demonstrate the application of the methodology out- 
lined in the previous sections. We consider particles with 
a patch number and orientation that favours the forma- 
tion of cubic octamers, as illustrated in Figure |S1 The 
parameters of the model and conditions at which simu- 
lations were performed are given in the supplementary 
material*^ the specific values were chosen to give a rea- 
sonable yield of cubes in a fairly dilute system. 

Initially, simulations were performed in which umbrella 
sampling was used to accelerate the sampling of a sin- 
gle large cluster, and states of the system with more 
than one cluster of multiple particles were explicitly for- 
bidden. More details are provided in the supplemen- 
tary material^ This approach allowed the extrapolation 
methodology of Section UlI CI to be directly applied. The 
yields of various cluster sizes, and the corrections to ac- 
count for multiple large clusters, are given in Table [II] 
In this case, all concentrations (except that of isolated 
monomers) are seen to increase by around 50 % due to 
the extrapolation. An increased concentration from ex- 
trapolation is as would be expected: by not sampling 
states with multiple large clusters, we measure a reduced 
concentration relative to the true result. 

Also shown in Table |TT] are the results of simulations 
in which multiple large clusters were not explicitly for- 
bidden, but which still only used the largest cluster to 
bias the ensemble and accelerate sampling. In this case, 
the yield of smaller multiple-particle clusters (with fewer 
than six particles) is seen to agree well with the extrap- 
olation. Larger clusters, however, do not match the ex- 
trapolation and clusters of eight or nine particles have 
a yield consistent with the one-cluster simulations. This 
result indicates that these simulations failed to sample 
states with multiple clusters of this size, due to the large 
free-energy barrier associated with formation. Accelerat- 
ing sampling using only the size of the largest cluster is 
therefore a poor way to equilibrate such a system. The 
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size j 




v[(j)]^ d 








1 


3.2308(2) 


3.2308(2) 


3.2312(86) 


3.2293(2) 


3.2308(2) 


2 


1.460(8) x 10~ 2 


2.200(2) x 10 -2 


2.192(7) x 10 -2 


2.029(9) x 10~ 2 


2.027(5) x 10" 2 


3 


2.188(12) x 10~ 4 


3.297(8) x 10" 4 


3.273(19) x 10~ 4 


3.044(21) x 10~ 4 


3.038(9) x 10" 4 


4 


1.032(7) x 10" 4 


1.556(9) x 10" 4 


1.591(97) x 10~ 4 


1.431(9) x 10" 4 


1.433(8) x 10" 4 


5 


5.239(43) x 10^ 6 


7.894(52) x 10" 6 


7.89(62) x 10" 6 


7.298(57) x 10~ 6 


7.273(52) x 10" 6 


6 


1.603(16) x 10~ 5 


2.416(21) x 10" 5 


2.01(35) x 10" 5 


2.249(19) x 10~ 5 


2.226(20) x 10" 5 


7 


7.106(53) x 10~ 5 


1.071(8) x 10" 4 


7.320(53) x 10~ 5 


9.94(11) x 10~ 4 


9.865(68) x 10" 4 


8 


0.3212(35) 


0.4843(78) 


0.3244(41) 


0.4371(75) 


0.4461(63)) 


9 


5.360(80) x 10" 5 


8.08(14) x 10" 5 


5.22(10) x 10" 5 


7.31(19) x 10" 5 


7.44(12) x 10" 5 



TABLE II. Yields of various clusters of size j from simulations of a patchy particle model™ ^[(i)]^"" i s the yield from simulations in 

which only a single cluster of more than one particle could form, v [(j')] pred is the extrapolation of that result to the limit of an arbitrary 
number of clusters, performed using Equation 1271 i>[(i)]/ ln \ is the yield in simulations in which any number of clusters was permitted 

to form, but biasing was only applied to the largest cluster, v [(j)]^™ is the yield from simulations which also accurately sampled states 

containing two clusters of more than one particle - this should be compared to the prediction i>[Cj)](2^ d obtained by solving Equation 1341 

given v[(j)]ffi. 



technical difficulty of biasing the formation of an arbi- 
trarily large number of clusters makes the extrapolation 
procedure a useful alternative. 

The accuracy of the extrapolation scheme can be vali- 
dated by considering the formation of two large clusters. 
If up to two clusters of more than one particle are sam- 
pled, the average number of clusters of size j observed in 
the simulation is 



m e to i + 



v[U)] 



(i) 



E 



\^ Z (k) Z (l) 0n{k+l) 

^ 2k\l\ 



E 



^fe)_„2/3 M fe 



k> 



{ 2(fc!) 



(30) 

This expression follows from considering the contribution 
to the partition function of all states with two or fewer 
large clusters. The sum over k =/= I > 1 is a sum over both 
k > 1 and I > 1, with terms when k = I absent. Note 
that factors of i arise to avoid double counting during 
sums, and when more than one cluster of the same size 
is present due to indistinguishability. As in Equation [26} 
the contribution of monomer partition functions cancels, 
and is not included. 

Equation [30] simplifies to 



[(i)]( 2) = 



Ki)] (i + E*>i «[(*)]) 



E*>i «[(*)] + !(£*>! «[(*)]) 



2 ■ 



(31) 



Further simulations under identical conditions to the 
original single-cluster simulations were performed. In 
this case, up to two clusters of more than one particle 
were allowed, and umbrella sampling was used to bias 
the size of the two largest clusters. Further details are 
provided in the supplementary material,— and the re- 
sults are shown in Table [TT1 along with the yield pre- 
dicted by Equation [34] using the f[(j)](™ found in the 



single-cluster simulations. As is evident from Table [TT] 
the extrapolation method agrees extremely well with the 
explicit two-target simulations. This strongly suggests 
that the extrapolation to bulk under these conditions is 
reliable. 

It is possible to detect some very small non-ideal ef- 
fects. Specifically, for a truly ideal system, v[(l)] = 



(i) 



3.2372 for the conditions used here, as de- 



scribed in the supplementary material™ Table [IT] shows 
that v[(l)](™ is smaller than this, presumably due to ex- 
cluded volume effects. This is consistent with the fact 
that allowing a second large cluster, and thereby increas- 
ing excluded volume, suppresses the yield of monomers 
further {v[l]f^ < v[(l)]ffi). These non-ideal effects, 
however, are very small (the concentration of monomers 
is reduced by less than 0.25% relative to the ideal limit 
in a simulation which allows up to two large clusters). 
Furthermore, as higher numbers of large clusters in a 
small volume contribute a limited amount to the parti- 
tion function, this difference will likely remain small. 

In this simulation volume it would be clearly inappro- 
priate to make the approximation that all particles are 
part of the same cluster, as ^ fe [(fc)] > 1 . One could imag- 
ine, however, reducing the volume by ~ 100, in which 
case 5Zt[(/f)] *C 1 (in fact, a volume this small would 
probably lead to percolating clusters, but the point it 
illustrates is generally valid). Even in this limit, how- 
ever, the probability of observing two monomers (~ 10~ 3 ) 
would be larger than observing a genuine dimcr (~ 10 -4 ), 
and the probability of observing an 8-particle cluster 
and a monomer (~ 10~ 4 ) would be significantly larger 
than a 9-particle cluster (~10~ 7 ). This illustrates that 
EfcK^)] "C 1 is not enough to justify quantitative yields 
being inferred from the very dilute approximation, as 
highlighted in Section MI Dl 
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G. Multi-species clusters 



V. DISCUSSION 



At this stage, only simulations involving one type of 
particle have been considered. The results in this section 
are easily extended to simulations of two or more species, 
each maintained by their own chemical potential. If only 
one cluster of more than one particle is sampled, the anal- 
ogous result to Equation [27] for two species is 



j\k\ e 



1 + y fk4i e /3(Mip+M2 9 ) ' 

1 Z-~ip+q>l ptq\ 



(32) 



where P(j, k) is the probability of observing a cluster 
consisting of j particles of type 1 and k particles of type 
2. Thus the simulation concentration is 



Uk)} 



(i) 



Uk)} 



1 + E P+q >i «[(p> <?)]' 



(33) 



The result is an obvious generalization of the single- 
species case. The approximations can also be checked 
by sampling the formation of two clusters, with a yield 
that is analogous to Equation IM1 



Uk)}(2) = 



Uk)](l+ E v[{p,q)]\ 

V p+q>l / 

i+ e + E "[(P'«)]) 

p+q>l \p+q>l ) 

(34) 



In this paper we have extended the methodology of 
Reference [63| to deal with the inference of bulk properties 
from small simulations of self-assembly involving multi- 
ple particle species and systems in the grand canonical 
ensemble. In general, bulk systems are directly compara- 
ble to experimental studies, but it is often only feasible to 
simulate assembly of a single target. The methods pre- 
sented here can be viewed as a process for inferring stan- 
dard free energies of formation for self-assembling sys- 
tems from small simulations, and checking the accuracy 
of the ideal assumptions underlying that inference. 

For simulations of a single self-assembling cluster in the 
canonical ensemble, large deviations from the bulk yield 
that would be found for the same model are observed due 
to neglected concentration fluctuations. These errors can 
be corrected using the methodology presented here, un- 
der the assumption that separate clusters behave ideally. 
If the formation of two or more clusters can be studied, 
the accuracy of this assumption can be checked by ex- 
amining the convergence on the large system limit. As 
with clusters of one species of particle^ convergence on 
the bulk limit as system size increases can be very slow, 
particularly if one cluster-type dominates the ensemble. 

As a consequence, if quantitative data is to be ex- 
tracted from canonical simulations of a single cluster, 
this methodology (or something equivalent) should be 
applied. A summary of the necessary steps for extrapo- 
lating results to the bulk limit from a single-target canon- 
ical simulation is: 



IV. INFERENCE OF PROPERTIES OTHER THAN 
YIELDS 

This paper has been hitherto devoted to inferring clus- 
ter yields of bulk systems from those found in small sim- 
ulations. Other properties, such as the average potential 
energy of the system, or the frequency of a certain type 
of interaction, may also be of interest. Let us assume 
that we wish to calculate the thermodynamic average of 
a quantity A in bulk, where the tilde indicates that the 
quantity is normalized per particle in the system. Un- 
der the assumptions of the formalism presented here, in 
which separate clusters do not interact, the internal prop- 
erties of a given cluster {i} are identical in a small system 
and in the bulk limit. The relative proportions of differ- 
ent clusters do change, however. To calculate the average 
of A in the bulk limit, therefore, we must measure the 
average for each cluster type in a small simulation, A^y, 
then perform a weighted average using the bulk cluster 
yields inferred via the methods presented in this article. 



A {l} [{i}}i to t 

E{i}[0}]*tot 



(35) 



where i to t — Ej h ^ s ^ nc t°t a l number of particles in a 
cluster. 



• Perform a single-target simulation in a volume v, 
measuring the cluster frequency u[{i}]m. 

• Obtain ipuy by fitting the measured u[{i}]m using 
Equation [T5] 

• Solve for the bulk concentration {{i}] using v and 
ipu\ in Equation [3] whilst fixing the total concen- 
trations using Equation 01 

Furthermore, if a canonical simulation is performed in 
which several clusters can form (d possible clusters in a 
volume dv), the relevance of statistical finite size effects 
can be assessed in the following manner: 

• First, assume the observed cluster distributions are 
reflective of bulk concentrations [{«'}]. 

• Use these [{i}] to estimate the ratios using 
Equations [3] and HI 

• Follow the convergence of cluster yields on the bulk 
limit using Equation 1131 If the yield at size d is 
consistent with the initial results, then the conse- 
quences of simulating a finite number of clusters are 
likely to be negligible. If, however, the results are 
not self-consistent, then finite size effects remain 
significant at a system size d. 
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Extending the analysis of Reference HH to many con- 
stituent species also allows us to treat the special case in 
which one rcactant is localized in space. We find that the 
extrapolation procedure is unchanged if only one species, 
that contributes at most one particle to any cluster, is 
localized. This is particularly relevant to several studies 
involving binding of DNA to strands that are localized 
on a surface, such as References I55U581 None of these 
groups included a finite size correction, and hence the 
quantitative results are not directly applicable to bulk 
systems. We note that the density of adsorbed strands 
on DNA microarray surfaces, a common case in which 
localization is relevant, is usually high enough to cause 
interactions that invalidate the assumptions in this work. 
Su ch p hysics, however, was not explored in References 
|55to8| as only a single target was considered. These in- 
vestigations can therefore only be sensibly compared with 
the low-density limit, for which the finite-size corrections 
presented in this work are appropriate. 

We have also considered simulations in the grand 
canonical ensemble. This technique naturally incorpo- 
rates the concentration fluctuations that were absent in 
canonical systems. Bulk yields, however, can only be ob- 
tained if multiple large clusters can form in the simulation 
volume, a process which may be difficult to sample. We 
have shown how to use the data collected for the assem- 
bly of one cluster to infer the bulk yield under the usual 
ideal approximations, and how to check those approxi- 
mations if two clusters can be simulated. The procedure 
for extrapolating to bulk when only one large cluster can 
be simulated is: 

• Perform a simulation in a volume v in which only a 
single target cluster is sampled. Measure the clus- 
ter frequency ignoring any states that con- 
tain multiple large clusters. 

• The bulk concentration of monomers under these 
conditions is given by the concentration in the 
single-target simulation. To obtain all other con- 
centrations, use the measured to construct 
the matrix Equation [29l and solve for [{i}} by in- 
verting it. 

• If desired, tp^y can be extracted from Equation [3] 
using [{?}], allowing the bulk yield to be estimated 
at other concentrations. 

An alternative technique that has been used in the 
past for grand canonical simulations is to assume that 
any state in which j particles are in the simulation vol- 
ume corresponds to a j-particlc cluster. This assumption 
of extreme dilution simplifies the simulation (there is no 
need to define or measure clusters within the simulation), 
but it can be applied over a much smaller range of con- 
ditions. It has been claimed that it is valid provided 
the total number of particles in a simulation volume is 
small: we have shown, however, that even in this limit 
large quantitative errors can exist. By contrast, the tech- 
nique demonstrated here is valid whenever separate clus- 



ters behave in an approximately ideal fashion, and allows 
the accuracy of the extrapolation to be quantitatively as- 
sessed. 

In addition to the theoretical analysis presented here, 
we have shown examples of typical self-assembling sys- 
tems for which the bulk yield can be accurately inferred. 
We have demonstrated that the corrections are practi- 
cally implcmentable, and that the accuracy of the un- 
derlying assumptions can be reasonably assessed. We 
have also outlined a procedure for inferring bulk proper- 
ties other than the yields of clusters, such as the average 
potential energy, using values measured in single-target 
simulations. 

In some cases the cluster yields obtained using the 
methodology presented here may be described as only 
quantitatively, rather than qualitatively, different from 
the single-target data. Nonetheless, if comparisons of 
models with bulk experiment are to be made, then it is 
sensible to apply these corrections, as failure to do so 
is analogous to reporting results obtained with a faulty 
algorithm that causes a quantitative error. Although 
mcsoscale models are never going to give precise de- 
scriptions of all the properties of a system, many have 
recently been used to provide quantitative comparisons 
of yields with experiment ) 46-i8 i 50- - 52 i 54- - 58 i 6i an( j nence 

should consider the corrections presented here. Other au- 
thors have not compared to experiment, but have related 
equilibrium thermodynamics obtained from single-target 
simulations to bulk simulations of the same mode h 35 i 39 
The effects discussed in this work are relevant to such 
a self-consistent comparison. As computational power 
increases, the simulation of self-assembly for more de- 
tailed models will become possible: to compare these ap- 
proaches with experiment, whether to make predictions 
or validate and parameterize force fields, finite-size cor- 
rections may be relevant. Finally, as the inference of the 
bulk yields generally requires much less effort than per- 
forming the simulations themselves, it seems sensible to 
do it in all cases. 
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Appendix A: Mesoscale models used in the examples 
1. DNA 

The examples of DNA self-assembly presented in 
the main paper involve the coarse-grained model of 
Reference^ 7 -, using its most recent parameterization in 
Reference^. In short, the model treats DNA as a string 
of rigid nucleotides which interact through physically 
motivated pairwise contributions to the energy. The 
rigid nucleotides contain interaction sites to represent the 
sugar-phosphate backbone and the base. Of particular 
importance for our purposes are the hydrogen-bonding 
interactions, which allow base pairs to form between nu- 
cleotides. Bases come in four types: adenine (A), gua- 
nine (G), cytosine (C) and thymine (T). In this model, 
AT and GC can form complementary base pairs through 
hydrogen-bonding interactions. This base pairing leads 
to the formation of double-helical bound states for two 
strands with complementary sequences. 

The short-ranged nature of the interactions in the 
model means that there is a clear distinction between 
bound states of two strands, with a substantial energy of 
interaction, and unbound states, with no interaction en- 
ergy In all cases we consider two strands to be bound if 
there is at least one hydrogen-bonding interaction with 
an energy more negative than — 0.60kcalmol -1 , about 
1/7 of a typical hydrogen-bonding interaction. The re- 
sults presented are not sensitive to the precise value of 
this cutoff, as the coopcrativity of helix formation means 
that the overwhelming majority of bound pairs have well- 
formed duplexes. 



2. Patchy particles 

The demonstration of self-assembly involving cubic oc- 
tamers in Section III.F of the main paper used the patchy 
particle model of Reference^. In this model, particles 
have a number of 'patches' distributed over a spherical 
surface with a symmetry that determines the structure 
of stable clusters. Two particles interact through short- 
ranged repulsion and medium-range attraction, the lat- 
ter being modulated by terms related to the angular and 
torsional alignment of the best-aligned pair of patches on 
the particles. 

Particles with three patches as shown in Figure 




FIG. 9. A patchy particle which tends to form cubic octamcrs. 
The centre of the large sphere represents the particle's centre of 
mass. The smaller spheres representative interactive patches. Note 
that the smaller spheres are simply illustrative of patch location: 
the patches have no actual volume. 



Parameter name 


Value 


e 


1 




1 


2 

ang 


0.2 


Jl 

'-'tor 


0.4 



TABLE III. Specific parameters of the patchy particle model of 
Referenced used in this study, e sets the energy scale of the in- 
teraction between particles, (Tlj the range and o"^ ng and o-^ OI the 
width and torsional tolerance of the patches on the particles. 



have a tendency to form cubic clusters: the inclusion 
of angular and torsional modulation of interactions dis- 
favours alternatives such as large aggregates^. In this 
work we use the parameters given in Table IIIII as con- 
venient choices within the range of values considered in 
Reference^. Note, however, that we use much lower con- 
centrations (by a factor of ~ 250) than typically studied 
in Reference^. This difference ensures that the ideal 
assumptions required for extrapolation remain valid in 
our simulations. At higher concentrations, differences be- 
tween single-target and bulk systems persist but cannot 
be accurately treated within the framework of the main 
article. As pointed out by Wilber et ai, however, the 
concentrations used in Reference^ are artificially high in 
order to accelerate the kinetics of assembly. 

During simulations, particles with an interaction en- 
ergy Eij < — O.le, with e being the energy scale of the in- 
teraction, were considered to be part of the same cluster. 
The results obtained are not sensitive to small changes 
in this value. 



Appendix B: Simulation methods 

1. Metropolis Monte Carlo 

The Metropolis Monte Carlo algorithm (MC)Z£ is a 
widely used method for calculating the thermodynamic 
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properties of a system. From a given initial state, a trial 
move to another state is selected and accepted with a 
probability that ensures the algorithm samples from the 
Boltzmann distribution. The simplest MC algorithms at- 
tempt state changes that involve altering a single object 
within the system - for instance the position or orienta- 
tion of a particle in a molecular simulation. 



2. Virtual Move Monte Carlo 

Simple MC algorithms can face difficulty in reach- 
ing equilibrium if the collective motion of strongly- 
interacting particles is required, as such collective motion 
is slow when only single-particle moves are attempted. 
Algorithms which attempt to move clusters of particles 
can overcome this difficulty: one example is the 'Vir- 
tual Move Monte Carlo' (VMMC) algorithm^, which dy- 
namically generates clusters of particles based on energy 
changes from trial moves. When simulating DNA, we use 
the variant of VMMC in the appendix of Reference^. 



3. Umbrella sampling 

Even with VMMC, self-assembly processes can be slow 
to equilibrate due to high free-energy barriers during for- 
mation. Umbrella sampling^, which involves imposing 
an artificial biasing weight W(r N ) on a system with de- 
grees of freedom r N , can be used to reduce the effec- 
tive height of the barrier. A lower barrier means tran- 
sitions occur more quickly, and equilibration is acceler- 
ated. The thermodynamic expectation of any variable A 
follows from the biased sample obtained as 



(A) = 



(A(r N )/W(r N )) 



ir 



(Bl) 



Here ()w indicates the expectation found 
by sampling from the biased distribution 
W(r N )exp(-/3U(r N )), with U(r N ) being the inter- 
nal energy. 



Appendix C: Simulation details 

1. DNA simulation 

The DNA model was simulated using the VMMC al- 
gorithm, with initial trial moves being either: 

• Rotation of a nucleotide about its backbone site, 
with the axis chosen from a uniform random dis- 
tribution and the angle from a normal distribution 
with mean of zero and a standard deviation of 0.2 
radians. 

• Translation of a nucleotide with the direction cho- 
sen from a uniform random distribution and the 



distance from a normal distribution with mean of 
zero and a standard deviation of 1.7 A. 



a. DNA trimer formation 

The simulation of DNA trimers involved three distinct 
strands: 

1. 5'-GACGACTTAAGGAG-3' 

2. 5'-CTCCTTTTCGACCG-3' 

3. 5'-CGGTCGTTGTCGTC-3' 

Here, the sequence specifies the base of each nucleotide: 
adenine (A), guanine (G), thymine (T) or cytosine (C). 
The 3' and 5' symbols indicate strand directionality. The 
Watson-Crick rules of complementarity, which arc incor- 
porated into the model, dictate that strong bonds can 
only form between AT and CG pairsS. As a result, the 
three strands tend to form three-armed junctions, as each 
strand has two 6-base sections that are complementary 
to 6-base sections on different strands. 

All simulations were performed at 307.7 K, in a sim- 
ulation volume of 1.669 x 10~ 23 m 3 per trimer. 20 and 
40 simulations were performed for the assembly of one 
and two trimers respectively, using 4 x 10 10 attempted 
moves of the VMMC algorithm each. Umbrella sampling 
was used to enhance equilibration. For the single-target 
simulations, the umbrella potential was given by 



W = D x {ai)E x {p)F{n h - n c 



(CI) 



Here Si represents the size of the largest cluster in the 
system (in terms of number of strands), p the smallest 
non-zero number of base pairs between any two strands, 
n c the number of clusters and rib the number of pairs of 
interacting strands. The functional form of D\ is 



10 if si = 1 

1 if Si = 2 

25 if si = 3 



(C2) 



F is given by F(7i& — n c ) = 0.04(™ b ~" c ), and E\ is given 
in Table llVl For two-target simulations, the umbrella po- 
tential was also a function of the size of the second largest 
cluster in the system, S2'- W = D2{s\, S2)E2(p)F(ni, — 
n c ). F has the same definition as for the single-target 
case, E 2 is defined in Table flVl and D 2 in Table IVl 

In order to perform the extrapolation, ipuy must be 
fitted. This fitting was performed by minimizing 



(C3) 



with u[{i}] slm being the measured average number of 
clusters of type {i} in a single-target simulation, and 
being the estimate obtained from Equation 11 of 












P 













1 


2 


3 


4 


5 6 


> 7 


Ei(p) 


1 


2000 


400 


50 


10 


5 3 


1 


E 2 (p) 


1 


1000 


150 


30 


5 


2 3 


1 



TABLE IV. The functions E±(p) and i?2(p) used in the umbrella 
potential for simulations of DNA trimer formation. 



D 2 (si, s 2 ) 


1 2 


3 


si 

4 


5 6 











10 6 


S2 1 


10 0.8 


25 


10 3 


3 x 10 4 


2 


0.2 


15 


10 3 




3 




5 x 10 3 







TABLE V. The function D2(s±, S2) used in the umbrella potential 
for two-target simulations of DNA trimer formation. Values of 
si, S2 with no entry are impossible. 



the main text for a given set of il>{i}- The minimization 
was performed using the Matlab 'fminsearch' function. 

Due to the biasing umbrella potential, different simu- 
lations with the same number of VMMC steps had differ- 
ent overall statistical weight. The data reported in Table 
I of the article therefore involve weighted estimates of 
the mean and standard error, using the "ratio estimate" 
of Cochran^. Extrapolation was performed individually 
for each single-trimer simulation, and averaged using the 
same weighting factors. 

b. DNA duplex with one strand localized 

The simulation of DNA with one localized strand used 
two sequences: 

1. 5'-TTTAGCTCA-3' 

2. 5'-TGAGCT-3' 

Single-target simulations were performed in a cubic peri- 
odic cell of volume 1.669 x 10 - 23 m 3 and at a temperature 
of 300 K. To model a surface to which strands might be 
attached, an infinite energy penalty was imposed upon 
any backbone sites that entered the region \z\ < 8.5 A. 
The longer of the two sequences was attached (via the 
backbone site at the 5' end) to the point (0, A A, 8.5 A) 
by a harmonic spring with a spring constant 0.571 Nm" 1 
and equilibrium length 8.5 A. For two-target simulations, 
a cell of twice the volume was used and the second strand 
of type 1 was attached at (161 A, 161 A, 8.5 A), far enough 
away from the first to avoid any interaction. 

We performed 10 single-target and 20 two-target sim- 
ulations, each consisting of 4 x 10 10 attempted VMMC 
steps. Umbrella sampling was used to accelerate equi- 
libration. In the single-target simulation, the bias used 
was W = Gi(t, c), with t being the total number of bonds 
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Gi(i,c) 





1 2 


t 

3 


4 


5 


6 


7 8 


9 


> 10 





3 


300 100 


20 


3 


1 


1 


1 1 


1 





1 




3.6 x 10 4 100 


20 


3 


1 


1 


1 1 


1 





2 




1800 


20 


3 


1 


1 


1 1 


1 





c 3 






140 


3 


1 


1 


1 1 


1 





4 








10 


1 


1 


1 1 


1 





5 










3 


1 


1 1 


1 





6 












1 


1 1 


1 






TABLE VI. The function G\{t,c) used in the umbrella poten- 
tial for simulations of DNA duplex formation involving a tethered 
particle. Values of c > t arc impossible. 



G 2 (t,c) 





1 2 


t 

3 


4 


5 


6 


7 8 


9 


> 10 





3 


3000 1000 


200 


30 


3 


1 


1 1 


1 





1 




3 x 10 4 1000 


200 


30 


3 


1 


1 1 


1 





2 




2000 


200 


30 


3 


1 


1 1 


1 





c 3 






200 


30 


3 


1 


1 1 


1 





4 








15 


3 


1 


1 1 


1 





5 










3 


1 


1 1 


1 





6 












1 


1 1 


1 






X 



TABLE VII. The function G2(i, c) used in the umbrella potential 
for simulations of two-target DNA duplex formation involving a 
tethered particle. Values of c > t are impossible. 



formed between the free strand and the tethered stand, 
and c being the number of these that are intended to form 
in the final (fully-aligned) structure (mis-aligned bonds 
can form, but only contribute to i, not c). The func- 
tional form of G is given in Table PVTl In the two-target 
simulation, W = G%{t\, ciJG^fe, C2) was used, with ti 
being the total number of bonds between tethered strand 
i and either of the free strands. The functional form of 
G2 is given in Table Ivnl The distribution of clusters 
was recorded at each step. As with the DNA trimers, 
weighted estimates of the mean and standard error of 
inferred yields are reported in the text. 

2. Cubic octamers formed from patchy particles 

Patchy particle simulations were performed using a 
simple MC algorithm, with additional moves for removal 
and addition of particles to make the ensemble grand 
canonical^ 2 -. Specifically, the attempted moves were: 

• Rotation of a particle about its centre, with the axis 
chosen from a uniform random distribution and the 
angle from a normal distribution with mean of zero 
and a standard deviation of 0.2 radians. 

• Translation of a particle with the direction chosen 
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from a uniform random distribution and the dis- 
tance from a normal distribution with mean of zero 
and standard deviation of 0.2ctlj 

• Addition of a particle with a randomly chosen po- 
sition and orientation. 

• Removal of a randomly chosen particle. 

Simulations were performed at a reduced temperature 
of T = 0.08, in a periodic cubic cell of volume 8000<7lj 
and with a chemical potential given by A "^a exp(/?/i) = 
3.2372. Here A is the de Broglie wavelength of the parti- 
cles, and £^Aq 3 is the contribution of the angular and an- 
gular momenta degrees of freedom to the partition func- 
tion of an isolated monomer. Defining (i in this way 
renormalizes it so that the particle masses and moments 
of inertia do not need to be considered. The choice of 
simulation parameters ensured that both monomers and 
octamcrs occurred in the simulation box with reasonable 
frequency. 

Umbrella sampling was used to accelerate equilibra- 
tion. In the case of simulations which allowed a single 
cluster of more than one particle, the umbrella bias was 
given by 

W = U 1 {s 1 )V(s 1 ,e 1 )9(n c -2). (C4) 

In this equation, s\ represents the number of particles 
in the largest cluster, e\ the interaction energy between 



particles in that cluster and n c the number of clusters 
containing more than one particle. The functional form 
of U\ is given in Table I Villi The functional form of V is 

j r / \ I 100 if si = 8 and ei < — 8 1 ,„ rN 
V(s 1 ,e 1 ) = i . \ , (C5) 

1 otherwise 

and the functional form of 9 is 

0(x) = I 1 if X < ° I . (C6) 
I otherwise I 

For the simulations that allowed an arbitrary number 
of clusters to form, but only biased the formation of one 
cluster, the umbrella potential was identical except for 
the 9 term, which was not included. In the two-target 
case, W(y n ) also depended on the size of the second 
largest cluster s 2 , and its energy e 2 . 

W = U 2 (s 1 ,s 2 )V(s 1 ,e 1 )V(s 2 ,e 2 )9(n c - 3). (C7) 

The functional form of U 2 is given in Table I Villi and V 
and 9 are defined as before. 

In the case of single-target simulations, 10 independent 
runs were performed of 10 11 MC steps each. For two- 
target simulations, 20 runs of 10 11 MC steps were used. 
As with the DNA simulations, weighted estimates of the 
mean and standard error of cluster yields were calculated 
by pooling the independent estimates. Matrix inversion 
was performed using the Matlab 'inv' function. 
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Ui(si) 














S l 













1 


2 


3 


4 


5 


6 


7 


8 


> 9 




3 0.5 


15 


10 3 


2 x 10 3 


5 x 10 4 


1.5 x 10 4 


3 x 10 3 


1.5 


1 
















Si 













1 


2 


3 


4 


5 


6 


7 


8 


> 9 





2 


4 


100 


10 4 


4 x 10 4 


8 x 10 5 


10 s 


2 x 10 4 


25 


1 


1 




0.5 


15 


10 3 


2 x 10 3 


5 x 10 4 


1.5 x 10 4 


1.5 x 10 3 


1 


1 


2 






300 


10 4 


2 x 10 4 


5 x 10 5 


1.5 x 10 5 


3 x 10 4 


20 


1 


3 








10 6 


2 x 10 6 


5 x 10 7 


1.5 x 10 7 


1.5 x 10 6 


500 


1 


4 










4 x 10 6 


5 x 10 7 


1.5 x 10 7 


3 x 10 6 


2000 


1 


5 












3 x 10 9 


3 x 10 8 


1 x 10 s 


5 x 10 4 


1 


S2 6 














3 x 10 8 


4 x 10 7 


10 4 


1 


7 
















10 7 


3 x 10 3 


1 


8 


















4 


1 


> 9 




















1 



TABLE VIII. The functions Ui(s\) and Ui(si,S2) used in the umbrella potential for simulations of the cubic octamer formation from 
patchy particles S2 > si is impossible. 



